Why Software Pipelining Delivers Better Performance

Why Software Pipelining Delivers Better Performance

To introduce the topic of software pipelining, let's consider the simple DAXPY loop (double precision a times x plus y) shown below.

DO i = 1, n

v(i) = v(i) + X * w(i)

ENDDO

On the R8000 architecture, this can be coded as two load instructions followed by a madd instruction and a store. See Figure 6-1.

Figure 6-1 : A Simple DAXPY Implementation This simplest schedule achieves the desired result after five cycles. Since the R8000 architecture can allow up to two memory operations and two floating point operations in the same cycle, this simple example uses only 1/10th of the R8000's peak megaflops. There is also a delay of three cycles before the results of the madd can be stored.

0: ldc1 ldc1 madd
1:
2:
3:
4: sdc1

A loop unrolling by four schedule improves the performance to 1/4 th of the R8000's peak megaflops.

0: ldc1    ldc1    madd
1: ldc1    ldc1    madd
2: ldc1    ldc1    madd
3: ldc1    ldc1    madd
4: sdc1
5: sdc1
6: sdc1
7: sdc1

But this schedule does not take advantage of the R8000's ability to do two stores in one cycle.The best schedule that could be achieved would look like the following:

0: ldc1    ldc1
1: ldc1    ldc1    madd    madd
2: sdc1    sdc1

It uses 1/3 of the R8000's peak megaflops. But there still is a problem with the madd sdc1 interlock delay. Software pipelining addresses this problem.

Software pipelining allows you to mix operations from different loop iterations in each iteration of the hardware loop. Thus the store instructions (which would interlock) in the above example could be used to store the results of different iterations. This can look like the following:

L1:
1: t1 = ldc1    t2 = ldc1   t3 = madd t1 X t2
2: t4 = ldc1    t5 = ldc1   t6 = madd t4 X t5
3: sdc1 t7      sdc1 t8     beq DONE

4: t1 = ldc1    t2 = ldc1   t7 = madd t1 X t2
5: t4 = ldc1    t5 = ldc1   t8 = madd t4 X t5
6: sdc1 t3      sdc1 t6     bne L1

DONE:

The stores in this loop are storing the madd results from previous iterations. But, in general, you could mix any operations from any number of different iterations. Also, note that every loop replication completes two loop iterations in 3 cycles.

In order to properly prepare for entry into such a loop, a windup section of code is added. The windup section sets up registers for the first stores in the main loop. In order to exit the loop properly, a winddown section is added. The winddown section performs the final stores. Any preparation of registers needed for the windown is done in the compensation section. The winddown section also prevents speculative operations.

windup:
1: t1 = ldc1    t2 = ldc1   t7 = madd t1 X t2
2: t4 = ldc1    t5 = ldc1   t8 = madd t4 X t5
L1:
1: t1 = ldc1    t2 = ldc1   t3 = madd t1 X t2
2: t4 = ldc1    t5 = ldc1   t6 = madd t4 X t5
3: sdc1 t7      sdc1 t8     beq compensation1

4: t1 = ldc1    t2 = ldc1   t7 = madd t1 X t2
5: t4 = ldc1    t5 = ldc1   t8 = madd t4 X t5
6: sdc1 t3      sdc1 t6     bne L1

winddown:
1: sdc1 t7      sdc1 t8     br ALLDONE

compensation1:
1: t7 = t3      t8 = t6
2: br winddown

ALLDONE:

Our example loop always does loads from at least 4 iterations, so we don't want to start it if we don't want speculative operations and if the trip count is less than 4. The following generalizes our example into a map of a software pipelined loop:

/* Precondition for unroll by 2 */
   do i = 1, n mod 2
      original loop
   enddo

   if ( n-i < 4 ) goto simple_loop
windup:
      ...
/* fp - fence post */
      fp = fp - peel_amount 
      ...

swp replication 0:
      ...
      if ( i == fp ) goto compensation 0   
      ...
swp replication n - 1:
      ...
      if ( i != fp ) goto swp replication 0

compensation n-1:

winddown:
   ...
   goto done

compensation 0:
/* Move registers to set up winddown */
   rx = ry
   ...
   goto winddown
   ...

compensation n - 2: 
   ...
/* Move registers to set up winddown */
   rx = ry
   goto winddown

simple_loop:

   do i = i,n
      original_loop
   enddo

done:

In practice, software pipelining can make a huge difference. Sample benchmark performances see more than 100% improvement when compiled with software pipelining. This makes it clear that in order to get the best performance on the R8000 (-mips4), you should compile your application to use software pipelining (-O3).